Voxtral Realtime: enable bf16 for Metal backend with quantization#17845
Voxtral Realtime: enable bf16 for Metal backend with quantization#17845mergennachin wants to merge 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17845
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Cancelled Job, 4 Unrelated FailuresAs of commit 52027ff with merge base 6db7f4c ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Enables and recommends bf16 for Voxtral Realtime exports on Metal when using quantization, updating CI export arguments and user-facing docs to reflect the preferred configuration for memory/throughput.
Changes:
- Update Voxtral Realtime docs to include bf16 memory footprint numbers and recommend
--dtype bf16for Metal quantized exports. - Adjust example Metal export command(s) to include
--dtype bf16alongsidefpa4w. - Update Metal CI export script to pass
--dtype bf16for thequantized-int4-metalconfiguration.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| examples/models/voxtral_realtime/model.md | Updates memory calculations and guidance around bf16 + quantization for Metal/CUDA. |
| examples/models/voxtral_realtime/export_voxtral_rt.py | Updates usage example to show Metal export with bf16 + fpa4w. |
| examples/models/voxtral_realtime/README.md | Updates Metal backend table and export examples to recommend bf16 with fpa4w. |
| .ci/scripts/export_model_artifact.sh | Ensures Metal int4 quantized CI export passes --dtype bf16. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The Metal AOTI backend already handles bf16 correctly (fp32 attention masks, fp32 RoPE upcast, dtype-agnostic KV caches and SDPA). Enable --dtype bf16 as the default recipe for Metal CI and update all documentation to recommend bf16 with fpa4w quantization. Fix a Metal shader compilation bug in the streaming encoder where bool.to(bf16) generates `bfloat tmp = 0.0;` — Metal Shading Language doesn't support implicit float-to-bfloat literal conversion. Use .float() instead and let mul_ handle type promotion.
40b6144 to
52027ff
Compare
The Metal AOTI backend already handles bf16 correctly (fp32 attention
masks, fp32 RoPE upcast, dtype-agnostic KV caches and SDPA). Enable
--dtype bf16 as the default recipe for Metal CI and update all
documentation to recommend bf16 with fpa4w quantization.